Goto

Collaborating Authors

 Saratoga


A survey of using EHR as real-world evidence for discovering and validating new drug indications

Talukdar, Nabasmita, Zhang, Xiaodan, Paithankar, Shreya, Wang, Hui, Chen, Bin

arXiv.org Artificial Intelligence

Electronic Health Records (EHRs) have been increasingly used as real-world evidence (RWE) to support the discovery and validation of new drug indications. This paper surveys current approaches to EHR-based drug repurposing, covering data sources, processing methodologies, and representation techniques. It discusses study designs and statistical frameworks for evaluating drug efficacy. Key challenges in validation are discussed, with emphasis on the role of large language models (LLMs) and target trial emulation. By synthesizing recent developments and methodological advances, this work provides a foundational resource for researchers aiming to translate real-world data into actionable drug-repurposing evidence.


I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

Kharchenko, Julia, Roosta, Tanya, Chadha, Aman, Shah, Chirag

arXiv.org Artificial Intelligence

This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark's effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.


'The vehicle suddenly accelerated with our baby in it': the terrifying truth about why Tesla's cars keep crashing

The Guardian

It was a Monday afternoon in June 2023 when Rita Meier, 45, joined us for a video call. Meier told us about the last time she said goodbye to her husband, Stefan, five years earlier. He had been leaving their home near Lake Constance, Germany, heading for a trade fair in Milan. Meier recalled how he hesitated between taking his Tesla Model S or her BMW. He had never driven the Tesla that far before. He checked the route for charging stations along the way and ultimately decided to try it. Rita had a bad feeling. She stayed home with their three children, the youngest less than a year old. At 3.18pm on 10 May 2018, Stefan Meier lost control of his Model S on the A2 highway near the Monte Ceneri tunnel. "The collision with the guardrail launches the vehicle into the air, where it flips several times before landing," investigators would write later. The car came to rest more than 70 metres away, on the opposite side of the road, leaving a trail of wreckage. Several passersby tried to open the doors and rescue the driver, but they couldn't unlock the car. When they heard explosions and saw flames through the windows, they retreated. Even the firefighters, who arrived 20 minutes later, could do nothing but watch the Tesla burn.


Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data

Wu, Xue, Tsioutsiouliklis, Kostas

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, they often struggle with complex reasoning tasks and are prone to hallucination. Recent research has shown promising results in leveraging knowledge graphs (KGs) to enhance LLM performance. KGs provide a structured representation of entities and their relationships, offering a rich source of information that can enhance the reasoning capabilities of LLMs. For this work, we have developed different techniques that tightly integrate KG structures and semantics into LLM representations. Our results show that we are able to significantly improve the performance of LLMs in complex reasoning scenarios, and ground the reasoning process with KGs. We are the first to represent KGs with programming language and fine-tune pretrained LLMs with KGs. This integration facilitates more accurate and interpretable reasoning processes, paving the way for more advanced reasoning capabilities of LLMs.


Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators

Wei, Timothy, Miin, Annabelle, Miin, Anastasia

arXiv.org Artificial Intelligence

Those methods usually rely on data curation reflecting on the deliberate reasoning path in specific application Large Language Models (LLMs) have recently demonstrated impressive areas. When it comes to complex application constraints, capabilities across various real-world applications. However, high-quality solutions often demand a large volume of data to cover due to the current text-in-text-out paradigm, it remains challenging enough data cases and the corresponding reasoning logic. This for LLMs to handle dynamic and complex application constraints, process fundamentally differs from the typical human cognition let alone devise general solutions that meet predefined process: faced with unfamiliar problems, people first seek to capture system goals. Current common practices like model finetuning and the overview of the underlying application constraints, which are reflection-based reasoning often address these issues case-by-case, potential rules summarized from observations. Next, these learned limiting their generalizability. To address this issue, we propose rules will be further refined when exception cases arise. The essence a flexible framework that enables LLMs to interact with system of this cognitive process lies in distilling rules and identifying minimal interfaces, summarize constraint concepts, and continually optimize cases for refinement rather than depending on the inefficient performance metrics by collaborating with human experts.


How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions

Kharchenko, Julia, Roosta, Tanya, Chadha, Aman, Shah, Chirag

arXiv.org Artificial Intelligence

Large Language Models (LLMs) attempt to imitate human behavior by responding to humans in a way that pleases them, including by adhering to their values. However, humans come from diverse cultures with different values. It is critical to understand whether LLMs showcase different values to the user based on the stereotypical values of a user's known country. We prompt different LLMs with a series of advice requests based on 5 Hofstede Cultural Dimensions -- a quantifiable way of representing the values of a country. Throughout each prompt, we incorporate personas representing 36 different countries and, separately, languages predominantly tied to each country to analyze the consistency in the LLMs' cultural understanding. Through our analysis of the responses, we found that LLMs can differentiate between one side of a value and another, as well as understand that countries have differing values, but will not always uphold the values when giving advice, and fail to understand the need to answer differently based on different cultural values. Rooted in these findings, we present recommendations for training value-aligned and culturally sensitive LLMs. More importantly, the methodology and the framework developed here can help further understand and mitigate culture and language alignment issues with LLMs.


Developing a Framework for Auditing Large Language Models Using Human-in-the-Loop

Amirizaniani, Maryam, Yao, Jihan, Lavergne, Adrian, Okada, Elizabeth Snell, Chadha, Aman, Roosta, Tanya, Shah, Chirag

arXiv.org Artificial Intelligence

As LLMs become more pervasive across various users and scenarios, identifying potential issues when using these models becomes essential. Examples include bias, inconsistencies, and hallucination. Although auditing the LLM for these problems is desirable, it is far from being easy or solved. An effective method is to probe the LLM using different versions of the same question. This could expose inconsistencies in its knowledge or operation, indicating potential for bias or hallucination. However, to operationalize this auditing method at scale, we need an approach to create those probes reliably and automatically. In this paper we propose an automatic and scalable solution, where one uses a different LLM along with human-in-the-loop. This approach offers verifiability and transparency, while avoiding circular reliance on the same LLMs, and increasing scientific rigor and generalizability. Specifically, we present a novel methodology with two phases of verification using humans: standardized evaluation criteria to verify responses, and a structured prompt template to generate desired probes. Experiments on a set of questions from TruthfulQA dataset show that we can generate a reliable set of probes from one LLM that can be used to audit inconsistencies in a different LLM. The criteria for generating and applying auditing probes is generalizable to various LLMs regardless of the underlying structure or training mechanism.


AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach

Amirizaniani, Maryam, Roosta, Tanya, Chadha, Aman, Shah, Chirag

arXiv.org Artificial Intelligence

As Large Language Models (LLMs) gain wider adoption in various contexts, it becomes crucial to ensure they are reasonably safe, consistent, and reliable for an application at hand. This may require probing or auditing them. Probing LLMs with varied iterations of a single question could reveal potential inconsistencies in their knowledge or functionality. However, a tool for performing such audits with simple workflow and low technical threshold is lacking. In this demo, we introduce "AuditLLM," a novel tool designed to evaluate the performance of various LLMs in a methodical way. AuditLLM's core functionality lies in its ability to test a given LLM by auditing it using multiple probes generated from a single question, thereby identifying any inconsistencies in the model's understanding or operation. A reasonably robust, reliable, and consistent LLM should output semantically similar responses for a question asked differently or by different people. Based on this assumption, AuditLLM produces easily interpretable results regarding the LLM's consistencies from a single question that the user enters. A certain level of inconsistency has been shown to be an indicator of potential bias, hallucinations, and other issues. One could then use the output of AuditLLM to further investigate issues with the aforementioned LLM. To facilitate demonstration and practical uses, AuditLLM offers two key modes: (1) Live mode which allows instant auditing of LLMs by analyzing responses to real-time queries; (2) Batch mode which facilitates comprehensive LLM auditing by processing multiple queries at once for in-depth analysis. This tool is beneficial for both researchers and general users, as it enhances our understanding of LLMs' capabilities in generating responses, using a standardized auditing platform.


System 2 Attention (is something you might need too)

Weston, Jason, Sukhbaatar, Sainbayar

arXiv.org Artificial Intelligence

Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next token generations. To help rectify these issues, we introduce System 2 Attention (S2A), which leverages the ability of LLMs to reason in natural language and follow instructions in order to decide what to attend to. S2A regenerates the input context to only include the relevant portions, before attending to the regenerated context to elicit the final response. In experiments, S2A outperforms standard attention-based LLMs on three tasks containing opinion or irrelevant information: QA, math word problems and longform generation, where S2A increases factuality and objectivity, and decreases sycophancy.


Machine Learning and Computer Vision Techniques in Continuous Beehive Monitoring Applications: A survey

Bilik, Simon, Zemcik, Tomas, Kratochvila, Lukas, Ricanek, Dominik, Richter, Milos, Zambanini, Sebastian, Horak, Karel

arXiv.org Artificial Intelligence

Wide use and availability of the machine learning and computer vision techniques allows development of relatively complex monitoring systems in many domains. Besides the traditional industrial domain, new application appears also in biology and agriculture, where we could speak about the detection of infections, parasites and weeds, but also about automated monitoring and early warning systems. This is also connected with the introduction of the easily accessible hardware and development kits such as Arduino, or RaspberryPi family. In this paper, we survey 50 existing papers focusing on the methods of automated beehive monitoring methods using the computer vision techniques, particularly on the pollen and Varroa mite detection together with the bee traffic monitoring. Such systems could also be used for the monitoring of the honeybee colonies and for the inspection of their health state, which could identify potentially dangerous states before the situation is critical, or to better plan periodic bee colony inspections and therefore save significant costs. Later, we also include analysis of the research trends in this application field and we outline the possible direction of the new explorations. Our paper is aimed also at veterinary and apidology professionals and experts, who might not be familiar with machine learning to introduce them to its possibilities, therefore each family of applications is opened by a brief theoretical introduction and motivation related to its base method. We hope that this paper will inspire other scientists to use machine learning techniques for other applications in beehive monitoring.